RWTH OCR: A Large Vocabulary Optical Character Recognition System for Arabic Scripts
نویسندگان
چکیده
We present a novel large vocabulary OCR system, which implements a 5 confidenceand margin-based discriminative training approach for model adap6 tation of an HMM based recognition system to handle multiple fonts, different 7 handwriting styles, and their variations. Most current HMM approaches are HTK 8 based systems which are maximum-likelihood (ML) trained and which try to adapt 9 their models to different writing styles using writer adaptive training, unsupervised 10 clustering, or additional writer specific data. Here, discriminative training based 11 on the Maximum Mutual Information (MMI) and Minimum Phone Error (MPE) 12 criteria are used instead. For model adaptation during decoding, an unsupervised 13 confidence-based discriminative training within a two-pass decoding process is pro14 posed. Additionally, we use neural network based features extracted by a hierar15 chical multi-layer-perceptron (MLP) network either in a hybrid MLP/HMM ap16 proach or to discriminatively retrain a Gaussian HMM system in a tandem approach. 17 The proposed framework and methods are evaluated for closed-vocabulary isolated 18 handwritten word recognition on the IfN/ENIT Arabic handwriting database, where 19 the word-error-rate is decreased by more than 50% relative compared to a ML 20 trained baseline system. Preliminary results for large-vocabulary Arabic machine 21 printed text recognition tasks are presented on a novel publicly available newspaper 22 database. 23 RWTH Aachen University Human Language Technology and Pattern Recognition Ahornstr 55, D-52056 Aachen, Germany Tel.: +49-241-80-21613 Fax: +49-241-80-22219 e-mail: @cs.rwth-aachen.de
منابع مشابه
Optical Character Recognition
Optical Character Recognition (OCR) is one of the challenging areas of pattern recognition. It gained popularity among the research community due to its vast application potentials. Extensive research has been done on OCR evidenced by a large number of research articles published in the literature during the last few decades. Most of the research works reported in this area are for Roman, Chine...
متن کاملKannada Character Recognition System A Review
Intensive research has been done on optical character recognition ocr and a large number of articles have been published on this topic during the last few decades. Many commercial OCR systems are now available in the market, but most of these systems work for Roman, Chinese, Japanese and Arabic characters. There are no sufficient number of works on Indian language character recognition especial...
متن کاملProbabilistic sequence models for image sequence processing and recognition
This PhD thesis investigates the image sequence labeling problems optical character recognition (OCR), object tracking, and automatic sign language recognition (ASLR). To address these problems we investigate which concepts and ideas can be adopted from speech recognition to these problems. For each of these tasks we propose an approach that is centered around the approaches known from speech r...
متن کاملAn Arabic optical character recognition system using recognition-based segmentation
Optical character recognition (OCR) systems improve human}machine interaction and are widely used in many areas. The recognition of cursive scripts is a di$cult task as their segmentation su!ers from serious problems. This paper proposes an Arabic OCR system, which uses a recognition-based segmentation technique to overcome the classical segmentation problems. A newly developed Arabic word segm...
متن کاملLexicon Reduction for Urdu/Arabic Script Based Character Recognition: A Multilingual OCR
Arabic script character recognition is challenging task due to complexity of the script and huge number of ligatures. We present a method for the development of multilingual Arabic script OCR (Optical Character Recognition) and lexicon reduction for Arabic Script and its derivative languages. The objective of the proposed method is to overcome the large dataset Urdu and similar scripts by using...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011